Heterogeneity in effect size estimates

Significance In conducting empirical research in the social sciences, the results of testing the same hypothesis can vary depending on the population sampled, the study design, and the analysis. Such variation, referred to as heterogeneity, limits the generalizability of published scientific findings. We estimate heterogeneity based on 86 published meta-scientific studies. In our data, the estimated design and analytical heterogeneity are substantive and of at least the same magnitude as sampling uncertainty, whereas population heterogeneity is smaller. The results suggest low generalizability of the empirical findings under consideration across different study designs and statistical analyses.

Figures 2 and 3 in the paper (i.e., the effect size and 95% confidence interval per laboratory).We determined the standard error per lab by dividing the difference between the effect size estimate and both the lower and upper bound of the 95% CI by the critical value of the normal distribution first and then averaging the two standard error estimates (pertaining to the lower and upper bound, respectively).As the estimates and 95% CIs in the figures are provided with only two digits after the comma, our re-estimates are subject to some degree of imprecision due to rounding; yet, the discrepancies between the heterogeneity estimates reported in the paper and our re-estimates turn out to be negligible for the heterogeneity measures. 2.This registered replication report involves 21 independent direct replications of Study 7 in Rand, Greene, & Novak 17 .We only consider the paper's primary analysis, i.e., the analysis without any exclusion criteria (intention-to-treat analysis).The paper reports Cochran's Q-test and I² but no confidence interval around I².All original data and analysis scripts are available at osf.io/9tpgx.While the heterogeneity estimates and the meta-analytic effect size are computationally reproducible, the 95% CI around the meta-analytic effect size differs slightly between Bouwmeester et al.'s report and our re-estimation. 3.This registered replication report involves 16 independent direct replications of Study 1 in Finkel et al. 18 The paper reports meta-analyses on five dependent variables, all of which are included in our reanalysis.The paper reports Cochran's Q, , I², and H² for each meta-analysis, but no confidence intervals around , I², and H².Heterogeneity estimates are based on the Hartung-Knapp 19 estimator.All original data and analysis scripts are available at osf.io/3nz7j.Re-estimating the heterogeneity measures based on the original data results in slightly different estimates as compared to what is reported in the original paper for all dependent measures except for "subjective commitment" (manipulation check); yet, the discrepancies are negligible (less than two percentage points for I², and less than 0.02 units for H²). 4 .The paper reports meta-analytic results based on 12 independent direct replications of Study 3 in Hart & Albarracín 20 .The paper reports meta-analyses on three dependent variables, all of which are included in our analysis.For each meta-analysis, the paper reports Cochran's Q-test, , I², and H², but no confidence intervals around , I², and H².All original data and analysis scripts are available at osf.io/hx7a4.All estimates reported in the original paper are computationally reproducible. 5.The paper reports a multi-lab (k = 23) preregistered replication of the egodepletion paradigm reported in Sripada, Kessler, & Jonides 21 .The paper includes meta-analyses on various outcome measures; we only include what appears to be the main analyses of the paper: the meta-analyses on differences in reaction times (RT) and reaction time variability (RTV) for the full samples, but not the meta-analyses on self-reported outcomes.The paper reports Cochran's Q-test and I², but no confidence intervals around I².All original data and analysis scripts are available at osf.io/kh85v.While the raw data available in the original study's OSF repository appears to be consistent with what is shown in Figures 1 and 2 in the paper (except for the mapping of lab names and results, which seems to be mixed up in the forest plots in the paper), the heterogeneity estimates reported in the paper cannot be (computationally) reproduced: the re-estimated Qstatistic is slightly smaller for both analysis, involving that I² in both re-estimations is about five percentage points smaller than what is reported in the paper. 6.This registered replication report reports the result of 22 independent direct replications of Experiment 1 in Srull & Wyer 22 .We only consider the paper's two primary analyses, i.e., (i) judgments of Ronald's hostility and (ii) judgments of ambiguously hostile behaviors.The paper reports Cochran's Q, , and I², but no confidence intervals around  and I².The original data and code are available at osf.io/mcvt7.All estimates reported in the original paper are computationally reproducible. 7.The paper reports the results of meta-analyses involving 23 independent direct replications on a variant of Experiment 4 in Dijksterhuis & van Knippenberg 23 .We only consider the paper's primary analysis.However, we do not include meta-analyses on the moderation effect of gender (which is also referred to as a primary analysis in the paper).The paper reports Q, , I², and H² estimates, but no confidence intervals around , I², and H².All original data and analysis scripts are available at osf.io/fyptm.All estimates reported in the original paper are computationally reproducible. 8.This registered replication report involves 19 independent direct replications of Experiment 1 from Mazar, Amir, and Ariely 24 .We only consider the paper's primary analysis.The paper reports estimates for Cochran's Q, ², and I², but no confidence intervals around ² and I².The original data and code are available at osf.io/mcvt7.All estimates reported in the original paper are computationally reproducible. 9.This registered replication report comprises 17 independent direct replications of Study 1 from Strack, Martin, & Stepper 25 .The original paper does not report any statistics on heterogeneity but focuses on the meta-analytic effect size estimate based on a randomeffect meta-analysis only.All original data and analysis scripts are available at osf.io/h2f98.All estimates reported in the original paper are computationally reproducible. 11.ManyLabs3 comprises multi-site direct and conceptual replications of 10 effects involving data collections across 20-21 independent sites each.In our review, we only consider the 10 primary effects [26][27][28][29][30][31][32][33][34][35] examined in ManyLabs 3, but not the three added interaction effects on elaboration likelihood 28 , credentials and prejudice 31 , and self-esteem and subjective distance 32 .The paper reports Cochran's Q-test and I² for each studied effect, but no confidence intervals around I².The original data and analysis scripts are available at osf.io/ct89g.For all but three studies [33][34][35] , the original scripts actually implement meta-analysis based on correlation coefficients rather than standardized mean differences.For these studies, we base our reestimations on correlation coefficients since meta-analyses could not be implemented in terms of Cohen's d and η² due to a lack of preprocessed data.While the heterogeneity estimates for the three studies mentioned above (in Cohen's d units) are accurately reproducible, estimates for all remaining meta-analyses differ quantitatively to some extent (less than 5 percentage points in terms of I² estimates) but are qualitatively akin. 12.ManyLabs1 comprises multi-site replications of 16 effects (from 12 original publications) [36][37][38][39][40][41][42][43][44][45][46][47] involving 34-35 data collections across 36 independent labs each.We include the 16 primary meta-analytic results in our review.For each of the studied effects, the paper reports Cochran's Q-test and I², but no confidence intervals around I².The original article points readers to the project's OSF repository (osf.io/wx7ck) for the original data and code; however, this repository does not contain the preprocessed data used for the meta-analyses.Yet, the Wiki of the 'Analysis Scripts' component includes a link to a GitHub repository, which includes the preprocessed data and the analysis scripts for the meta-analyses reported in the paper.As described in an erratum 48 , the original article contained some errors.One of these errors pertains to the heterogeneity results for the 'allowed/forbidden' effect 37 , which were flawed due to an error in the initial analysis code.

Klein et al. (2014)
The analysis scripts have been corrected, and our reanalysis for the effect matches the corrected statistics reported in the erratum.For another effect 47 , there are discrepancies (not mentioned in the erratum): while the Q-statistic obtained from the original data and code matches the reported number in the paper, the I² statistic differs (20.1% in the paper vs. 28.1% obtained in a reanalysis using the original code).All remaining heterogeneity estimates were reproducible.However, there are very minor discrepancies in results here and there, likely due to typos and/or rounding errors.
All our re-estimates coincide with what is reported by the original authors in the file heterogeneity.pdfavailable in the project's GitHub repository (containing the log outputs generated when running the analysis script provided in the same repository). 13.ManyLabs2 comprises multi-site direct and conceptual replications of 28 effects (from 26 publications) 39, involving data collections across 41-66 independent sites each. The pper reports Cochran's Q-test, , and I² for each studied effect.The paper also reports 95% confidence intervals around I² but no confidence interval around .As per the paper, the data and codes are available at osf.io/8cd4r; the Wiki of the OSF page indicates that the authors are aware of some bugs and points toward a GitHub repository to obtain reproducible data and code.We used the data and scripts available in this GitHub repo.In our review, we include 25 of the 28 meta-analyses; for three studies [71][72][73] , the meta-analyses could not be reproduced as the preprocessed data is missing.These 25 studies are consistent with the meta-analytic results reported in the summary spreadsheet (meta_analysis_wide.xlsx) available in ML2's GitHub repo.

Klein et al. (2018)
For all studies included in our review, we only consider the meta-analyses on all samples without moderators.All but one Q-tests and I² statistics (including the 95% CI) could be reproduced based on the original data: for one study 49 , the Q-statistic in the re-estimation (Q(48) = 10.02)differs from what is reported in the paper (Q(48) = 15.33), while the I² (and the 95% CI around I²) is consistent.
Furthermore, the  estimates reported in the paper are erroneous, likely due to a "mix of rounding errors and sometimes erroneously copying ²," as acknowledged by the authors in the "Bug Tracker" logs linked in the OSF repository's Wiki.Indeed, our re-estimates of  are consistent with the original authors' results as reported in the summary spreadsheet (meta_analysis_wide.xlsx). 14.ManyLabs4 comprises multi-site replications of the theory by Greenberg et al. 74,75 , involving data collections across 17 labs.While ten of the labs collected data using independently created "in-house" protocols, seven labs conducted the experiment using the "author-advised protocol."Since the variability in estimates for the "in-house" sample does not only pertain to population heterogeneity (but also involves potential design heterogeneity), we reestimate heterogeneity using a random-effects meta-analysis for the "author-approved protocols" only (based on the seven labs that implemented this protocol; n = 699 participants in total).The primary results by Klein et al. comprise three meta-analyses based on three different sets of exclusion criteria; for our review, we only consider the results of "exclusion set 1." As per the paper, the data and codes are available at osf.io/8ccnw; the Wiki of the OSF page recommends using the data and codes provided via the project's GitHub repository instead.The paper reports Cochran's Q-test, but neither  nor I², for a meta-analysis pooling both the "in-house" and the "author-advised protocols," as well as for a meta-analysis accounting for moderation effects of author-advised vs. in-house protocols; the paper does not report heterogeneity estimates based on a meta-analysis of the "author approved protocol" sample only.Yet, the meta-analytic results of all 17 estimates are computationally reproducible, just as the meta-analytic effect size estimate for the "authorapproved protocol" sample.

Design Heterogeneity
To the best of our knowledge, there are only two studies 76,77 that allow for systematic variation in effect size estimates pertaining to the same research question while ruling out other sources of potential variation to isolate design heterogeneity.Both studies are included in our analysis of design heterogeneity. 76.The study reports the results of a meta-analysis of 45 research designs proposed by independent teams.The original data and analysis scripts are available at osf.io/fyme2.We only consider heterogeneity estimates for analytic approach B, which rules out analytical heterogeneity by applying the same analysis to all experimental designs.All estimates relevant to our analyses are reported in the original paper: Q-statistic,  together with its 95% CI, and I² together with its 95% CI.As in the original article, ² is estimated using the DerSimonian-Laird 78 estimator.All estimates reported in the original paper are computationally reproducible in R, despite the original analysis having been carried out in Stata. 77.The study reports the meta-analytic results on five independent hypotheses comprising 12-13 research designs proposed by independent research teams.Data for the main studies and replications were collected in two populations for each of the five hypotheses.The original data and analysis scripts are available at osf.io/9jzy4.We consider both the main studies (for which data was collected on Amazon Mechanical Turk) and the replications (for which data was collected via PureProfile).All estimates relevant to our review (Q-test, ² together with its 95% CI, and I² together with its 95% CI) are reported in the paper.We re-estimated the meta-analyses for the five hypotheses using the original data and codes.The regenerated results differ quantitatively from the results reported in Table 2 of the original paper, yet they are qualitatively similar.For hypothesis 5, the meta-analytic results reported in the paper pertain to correlation coefficients, but the original study's replication kit also includes a meta-analysis based on effect sizes transformed into Cohen's d units; for our analysis, we rely on the latter such that all meta-analytical estimates for the study are in Cohen's d units.

Analytical Heterogeneity
To gauge the extent of analytical heterogeneity, we include three multi-analyst studies [79][80][81] .Note that none of the multi-analyst studies included in our review report the results of a random-effects meta-analysis since estimating meta-analytic models on multi-analyst studies is unconventional (we refer to the main text for a discussion). 79.The study reports the results of 120 independent analysts using the same dataset to test two empirical claims.The original data and code are available at osf.io/bw863.As in the primary analysis of the paper, we only consider the n = 99 and n = 101 results for research questions 1 and 2, respectively, that are reported in terms of standardized β coefficients. 80.The study reports the results of seven independent analysts on two causal empirical relationships.The original data and analysis scripts are available at osf.io/5637t.The original paper introduces the ratio between the standard deviation of effect size estimates across analysts and the mean standard error as a measure of analytical heterogeneity.

Huntington-Klein et al. (2021)
We report the estimates of this alternative measure for all meta-analyses included in our review in Supplementary Tables S1 and S2. 81.The study reports the results of 29 independent research teams using the same dataset to test the same hypothesis.The original data and code are available at osf.io/fa743.The data reports the effect size estimates per team (in terms of odds ratios) alongside the corresponding 95% CIs, but no standard errors.We transform the point estimate into log-odds ratios and back out standard errors by first log-transforming the upper/lower bounds of the 95% CI and then calculating the standard errors by dividing the (absolute) difference between the effect size estimate and the lower/upper bound of the 95% CI by the critical value of the normal distribution.

Silberzahn et al. (2018)
Table S2.Alternative measures quantifying heterogeneity in multi-analyst studies.For each multi-analyst study in our review (i.e., studies pertaining to analytical heterogeneity), we report the standard deviation of effect sizes across analysts, SD(yi), as a proxy of between-study variation, and the mean of the standard errors of the effect size estimates, M(sei), as a proxy for the average within-study variation.The relative measure to quantify heterogeneity proposed by Huntington-Klein et al. 80 , denoted as HRP, is defined as HRP = SD(yi) ÷ M(sei).In addition, we report the "H-equivalent" based on HRP, denoted as HP, defined as HP = (1 + HRP 2 ) 0.5 .To facilitate comparability, the table also reports estimates for the between-study variation (), the within-study variation (), the heterogeneity measure H, and the alternative heterogeneity measure HR =  ÷  obtained from random-effects meta-analyses.

Random-Effects Meta-Analysis
Huntington- Klein et al.'s Proxies